---
title: Apify Dataset
---

This guide shows how to use [Apify](https://apify.com) with LangChain to load documents from an Apify Dataset.

## Overview

[Apify](https://apify.com) is a cloud platform for web scraping and data extraction,
which provides an [ecosystem](https://apify.com/store) of more than two thousand
ready-made apps called _Actors_ for various web scraping, crawling, and data extraction use cases.

This guide shows how to load documents
from an [Apify Dataset](https://docs.apify.com/platform/storage/dataset) — a scalable append-only
storage built for storing structured web scraping results,
such as a list of products or Google SERPs, and then export them to various
formats like JSON, CSV, or Excel.

Datasets are typically used to save results of different Actors.
For example, [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor
deeply crawls websites such as documentation, knowledge bases, help centers, or blogs,
and then stores the text content of webpages into a dataset,
from which you can feed the documents into a vector database and use it for information retrieval.
Another example is the [RAG Web Browser](https://apify.com/apify/rag-web-browser) Actor,
which queries Google Search, scrapes the top N pages from the results, and returns the cleaned
content in Markdown format for further processing by a large language model.

## Setup

You'll first need to install the official Apify client:

```bash npm
npm install apify-client
```
import IntegrationInstallTooltip from '/snippets/javascript-integrations/integration-install-tooltip.mdx';

<IntegrationInstallTooltip/>

```bash npm
npm install hnswlib-node @langchain/openai @langchain/community @langchain/core
```

You'll also need to sign up and retrieve your [Apify API token](https://console.apify.com/settings/integrations).

## Usage

### From a New Dataset (Crawl a Website and Store the data in Apify Dataset)

If you don't already have an existing dataset on the Apify platform, you'll need to initialize the document loader by calling an Actor and waiting for the results.
In the example below, we use the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor to crawl
LangChain documentation, store the results in Apify Dataset, and then load the dataset using the `ApifyDatasetLoader`.
For this demonstration, we'll use a fast Cheerio crawler type and limit the number of crawled pages to 10.

**Note:** Running the Website Content Crawler may take some time, depending on the size of the website. For large sites, it can take several hours or even days!

Here's an example:

import ApifyDatasetNew from "/snippets/javascript-integrations/examples/document_loaders/apify_dataset_new.mdx";

<ApifyDatasetNew />

## From an Existing Dataset

If you've already run an Actor and have an existing dataset on the Apify platform, you can initialize the document loader directly using the constructor

import ApifyDatasetExisting from "/snippets/javascript-integrations/examples/document_loaders/apify_dataset_existing.mdx";

<ApifyDatasetExisting />
